.Net: Fix TextChunker orphan chunk token counting#14013
Open
MukundaKatta wants to merge 1 commit into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a bug in TextChunker.ProcessParagraphs where the “orphan paragraph” merge decision used word counts instead of the configured token-counting logic, which could produce a merged paragraph exceeding maxTokensPerParagraph when a custom tokenCounter is supplied.
Changes:
- Update orphan-merge logic to evaluate the merged candidate using
GetTokenCount(...)(consistent with the rest of the splitting flow). - Remove the now-unused
s_spaceCharconstant fromTextChunker. - Add a regression unit test using a length-based custom token counter to ensure orphan chunks are not merged beyond the token limit.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| dotnet/src/SemanticKernel.Core/Text/TextChunker.cs | Uses token counting (via GetTokenCount) to validate orphan-paragraph merges, preventing oversized merged chunks with custom token counters. |
| dotnet/src/SemanticKernel.UnitTests/Text/TextChunkerTests.cs | Adds a regression test covering the orphan-merge scenario with a custom token counter. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation and Context
Fixes #13713.
TextChunker.ProcessParagraphsused word counts when deciding whether to glue a small final/orphan paragraph back into the previous paragraph. With a custom token counter, that could merge two paragraphs whose actual token count exceedsmaxTokensPerParagraph, producing an oversized final chunk.Description
This changes the orphan merge check to build the candidate merged paragraph and evaluate it with
GetTokenCount(...), so the same token-counting logic controls both splitting and final orphan gluing. It also adds a regression test using a custom length-based token counter where the previous word-count check would have produced an oversized merged chunk.Contribution Checklist
Local verification:
git diff --checkpasses. I could not rundotnet test dotnet/src/SemanticKernel.UnitTests/SemanticKernel.UnitTests.csproj --filter FullyQualifiedName~TextChunkerTests --no-restorebecause this environment does not have thedotnetCLI installed.